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Abstract 

The  application  of  a  simple  variable  frame  rate  analysis  to  the  RSRE  Airborne 
Reconnaissance  Mission  system,  a  continuous  speech  recognition  system  based  on 
phone-level  hidden  Markov  models,  is  described.  Results  are  presented  which  show 
that  performance  using  the  variable  frame  rate  technique  and  triphone  models  can 
be  better  than  that  obtained  using  triphone  models  and  full  frame  rate  data.  The 
variable  frame  rate  technique  requires  considerably  less  processing  time. 
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Introduction 


This  memo  describes  the  application  of  a  simple  VFR  analysis  to  the  B.SRE  Airborne 
Reconnaissance  Mission  ( ARM)  system.  This  is  a  continuous  speech  recognition  system 
based  on  phone-level  hidden  Markov  models  (bums)  which  has  been  developed  at  the 
RSRE  Speech  Research  Unit.i 

In  a  companion  paper^f9] )  it  was  shown  that  variable  frame  rate  (vfr)  analysis  can 
be  used  to  actually  imnrdve  the  recognition  performance  of  certain  hidden  Markov  models 
(rums).  For  comptataiess  the  general  descriptions  from  [9]  are  replicated  in  Sections  2,  3 
and  4.  y' 

This  rpemo  concentrates  on  experiments  which  used  vfr  analysis  on  triphone  models, 
described  in  more  detail  in  Section  5. 

The  results  quoted  here  come  from  three  speakers,  unlike  those  in  [9]  which  came  from 
/a  single  speaker. 


2  The  Variable  Frame  Rate  Algorithm 

V 

This  section  will  briefly  describe  the  nature  of  the  data,  what  vfr  analysis  is,  and  its 
application  to  automatic  speech  recognition. 

Assume  that  at  any  “instant”  in  time  the  speech  signal  can  be  represented  by  an 
ordered  set  of  numbers,  or  feature  vector.  This  “instant”  is  assumed  to  be  short  enough 
that  the  properties  of  the  speech  signal  do  not  change  significantly.  Any  utterance,  or 
collection  of  words,  can  then  be  described  as  a  succession  of  feature  vectors  (sometimes 
referred  to  as  frames).  There  are  areas  where  the  speech  signal  is  relatively  constant  and 
hence  successive  feature  vectors  will  be  very  similar.  In  other  areas  the  signal  may  change 
rapidly  and  hence  successive  feature  vectors  will  be  different. 

In  order  to  reduce  the  processing  time,  one  obvious  solution  is  to  reduce  the  data 
(frame)  rate.  However,  parts  of  the  signal  which  are  changing  rapidly  contain  valuable 
information  and  so  need  to  be  retained.  For  this  reason  it  it  necessary  to  employ  some 
method  of  data  reduction  which  actually  depends  on  the  data.  Variable  frame  rate  coding 
is  such  a  technique. 


One  of  the  first  uses  of  vfr  for  data  reduction  in  automatic  speech  recognition  is 
described  in/fl)}  In  that  paper  the  authors  describe  several  different  vfr  algorithms.  This 
memo  has  utilised  the  simplest  of  those. 

The  vfr  algorithm  has  been  designed  to  retain  all  the  input  feature  vectors  when  they 
are  changing  most  rapidly  and  omit  a  high  proportion  when  they  are  relatively  constant.  A 
subset  of  the  feature  vectors  is  selected,  thus  avoiding  the  need  for  deciding  how  to  combine 
vectors.  All  that  is  required  is  the  calculation  of  some  measure  of  similarity  between  two 
feature  vectors  and  the  comparison  of  this  similarity  measure  with  a  threshold.  The 
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most  common  similarity  measure  used  is  the  Euclidean  distance  which  is  used  in  all  the 
experiments  quoted  here. 

In  the  simple  version  of  the  algorithm  the  distance  is  computed  between  the  last  re¬ 
tained  feature  vector  and  the  vector  under  consideration.  The  current  vector  it  then 
omitted  if  the  distance  is  less  than  the  threshold.  With  this  approach,  a  threshold  lets 
than  the  minimum  distance  (sero)  results  in  vectors  being  retained  at  the  original  frame 
rate.  A  threshold  set  to  the  maximum  distance  (effectively  infinity)  would  result  in  a 
single  vector  being  output,  and  an  intermediate  threshold  provides  a  variable  frame  rate 
dependent  on  the  speech  data. 

Specifically,  if  D(i,j )  is  the  distance  between  the  previously  selected  frame  j  rad  the 
current  frame  s',  and  the  threshold  is  T,  then  the  rule  is  to  select  frame  <  as  the  next  output 
frame  if  s- 

D(i,j)  >  T 

In  some  applications  different  thresholds  are  used  which  decrease  with  time  so  the  live¬ 
lihood  of  outputting  a  frame  increases  with  time.  This  application  has  used  a  single 
threshold  but  has  set  an  upper  bound  of  50  (referred  to  as  the  duplication  factor)  on  the 
number  of  frames  which  can  be  represented  by  ray  one  output  frame,  thus  effectively  in¬ 
corporating  a  time  constraint.  This  limit  is  only  reached  in  long  periods  of  silence  and  is 
necessary  to  ensure  that  they  are  not  completely  reduced. 


S  Speech  Representations 

The  speech  data  used  were  obtained  by  passing  digitised  speech  signals  through  a  27 
channel  filter  bank  analyser  at  100  frames  per  second.  The  filters  are  spaced  on  a  non¬ 
linear  frequency  scsde  based  on  that  in  [4]. 

As  specified  in  [10]  all  the  HMMs  used  data  which  had  the  speech  spectra  replaced  by  a 
cepstral  representation.  The  data  used  in  this  report  consisted  of  16  mel  frequency  cosine 
coefficients  (iirccs)  together  with  an  overall  amplitude  feature. 

For  the  results  reported  here,  the  count  of  frames  represented  by  a  given  feature  vector 
after  vm  analysis  is  appended  to  the  vector  as  an  additional  feature.  This  results  in  18 
features  for  16  mfccs.  The  use  of  this  additional  feature  provides  a  crude  duration  model, 
in  that  the  model  mean  counts  will  reflect  the  average  number  of  frames  in  the  original 
analysis  which  are  condensed  to  a  single  frame  corresponding  to  the  model  states.  The 
inclusion  of  this  extra  feature  was  shown  in  [9j  to  be  beneficial. 


4  Recognition  And  Scoring 


The  recognition  algorithm  used  is  a  sub- word  model  implementation  of  a  one-pass  dynamic 
programming  algorithm  ([2]). 
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Whole  ARM  reports  are  processed  including  silences  between  sentences. 

Scoring  is  based  on  a  dynamic  programming  alignment  at  the  phoneme  level,  taking 
account  of  the  known  sentence  end  times,  with  subsequent  marking  of  words  according  to 
whether  their  constituent  phonemes  correctly  line  up  (cf  [8],  [3]). 


Recognition  results  are  reported  for  two  levels  of  syntactic  constraint.  All  the  phone 
results  come  from  employing  the  simple  syntax  in  which  any  sequence  of  triphones  can  be 
recognised.  The  word  results  are  obtained  from  the  word  syntax  which  allows  recognition 
of  any  sequence  of  non-speech  sounds  and  words  from  the  ARM  vocabulary. 


As  with  the  results  quoted  in  [9],  these  results  are  presented  in  terms  of  %  words  correct 
and  %  word  accuracy.  These  are  computed  as  follows,  using  dynamic  programming  to  align 
the  true  transcription  of  the  test  data  with  the  output  of  the  recogniser: 


%  words  correct  = 


N-S-D 

N 


x  100,  %  word  accuracy  — 


N-S-D-I 

N 


x  100 


where  N  is  the  number  of  words  in  the  test  set,  and  5,  D  and  I  are  the  number  of  words 
substituted  (i.e.  recognised  as  the  incorrect  word),  deleted  and  inserted  respectively.  The 
more  interesting  results  are  those  in  the  columns  headed  “word  accuracy”  since  these 
reflect  more  closely  the  level  of  performance  which  would  be  perceived  by  a  user  of  the 
system. 

As  in  [10]  the  training  and  test  data  were  distinct  sets  of  ARM  reports.  Unless  other¬ 
wise  stated,  all  the  recognition  results  are  for  a  540  word  test  set. 


5  HMMs  And  Itiphone  Models 


The  ARM  system  will  not  be  described  in  full  here.  Further  detsuls  can  be  found  in  [10], 

[11]. 

The  theory  and  use  of  sub-word  hidden  Markov  models  for  automatic  speech  recogni¬ 
tion  is  now  well  established  (eg  [6]).  These  systems  typically  have  distinct  models  corre¬ 
sponding  to  each  phoneme  in  the  language,  which  are  combined  according  to  a  pronuncia¬ 
tion  dictionary  to  give  whole  word  models  for  recognition.  A  large  set  of  models  is  usually 
used,  allowing  different  models  for  a  given  phoneme  according  to  its  immediate  phoneme 
context  (so-called  triphone  models). 

The  version  of  the  ARM  system  described  in  the  earlier  paper  ([9])  used  a  smaller 
set  of  models:  four  models  for  non-speech  sounds;  six  models  of  short  common  words1 
and  sixty-one  models  of  the  phonemes  in  the  ARM  dictionary  (tome  phonemet  have  two 
distinct  models,  for  syllable-initial  and  syllable-final  consonants,  which  is  the  only  context 
sensitivity  embodied  in  that  model  set). 

'of,  or,  in.  at,  ait,  oh  (used  instead  of  tero  sometimes) 
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initial  state  (null)  final  state  (null) 

Figure  1:  Topology  of  3-state  phone-level  hmms  used  in  the  ARM  system 


For  the  experiments  reported  here,  the  same  set  of  function  word  and  non-speech 
models  is  used,  but  full  triphone  models  are  used  in  place  of  the  phoneme  models.  (There 
are  approximately  1500  word-internal  triphones  in  the  ARM  vocabulary;  word  boundary 
triphones  are  not  used.) 

In  the  earlier  paper  ([9]),  it  was  shown  that  using  three  states  per  phoneme  with  the 
vrs  analysis  produced  results  nearly  as  good  as  the  best  obtained  using  the  more  expensive 
duration  sensitive  model  topology  (the  “TI”  topology).  Therefore  the  work  reported  here 
uses  a  standard  topology  with  three  states  per  phoneme  J,  and  no  skip  transitions  as  shown 
in  Figure  1. 

All  state  output  probability  density  functions  of  HMMs  in  the  system  are  Gaussian 
with  a  diagonal  (co)variance  matrix.  The  same  variance  is  used  for  all  states  of  all  models 
in  this  version  of  the  system  (the  so-called  Grand  Variance)  in  order  to  reduce  the  total 
number  of  parameters  to  be  estimated. 

Initial  estimates  of  hmm  parameters  were  obtained  from  a  small  quantity  of  speech 
which  had  been  hand  labelled  at  the  phoneme  level.  Standard  HMM  algorithms  were  then 
used  to  train  context  insensitive  phoneme  models  on  the  full  training  set  of  36  ARM  reports 
(224  sentences,  1985  words),  using  much  coarser  labelling.  3  These  context  insensitive 
models  are  then  used  to  provide  initial  values  for  reestimation  of  the  corresponding  triphone 
models  as  described  in  [12]. 


6  Results 

Rill  results  are  quoted  for  two  (male)  speakers  ekm  and  MIR.  Some  of  the  experiments 
were  also  repeated  for  the  (female)  speaker  SRJ  used  in  [9], 

6.1  Effect  Of  Different  Thresholds  On  Data  Files 

When  applying  the  vfr  technique  to  speech  data  it  is  useful  to  know  what  sort  of  data 
reduction  is  being  obtained.  In  the  ARM  system  training  and  testing  files  are  dealt 

’AH  models  have  three  states  except  for  the  non-speech  models  which  hare  only  a  single  state;  and  the 
models  for  the  Auction  words  “at",  “in"  and  “or  which  hare  six  states. 

'The  data  for  the  two  male  speakers  was  automatically  labelled  in  breath  group*.  However  for  SRJ 
this  labelling  had  been  done  at  the  sentence  level,  combining  breath  groups  -  it  is  not  yet  known  how  this 
difference  affects  performance. 
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Figure  2:  The  effect  of  different  thresholds  on  numbers  of  frames,  from  typical  files,  pro¬ 
cessed  during  training  (solid  line)  and  during  testing  (dotted  line)  for  speaker  rkm. 


with  differently.  For  training  purposes,  only  the  actual  speech  in  the  file  is  used  -  any 
silence,  between  labelled  phrases  or  sentences,  is  ignored.  During  recognition  however 
whole  reports  are  processed,  including  silences  (and  breath  noises  etc)  between  sentences, 
unlike  many  other  systems.  The  data  reduction  obtained,  from  using  the  vfr  technique 
on  data  from  the  speaker  srj,  was  shown  in  [9],  Similar  reductions  were  obtained  for  the 
other  two  speakers.  Figure  2  shows  the  effect  of  the  threshold  on  the  number  of  frames 
processed  for  two  typical  files  for  the  speaker  RKM. 

In  this  figure,  the  solid  line  shows  the  amount  of  speech  used  in  a  typical  training  file. 
The  dotted  line  shows  the  total  amount  of  speech  (including  silences)  used  in  a  typical 
testing  file.  The  more  rapid  data  reduction  at  low  thresholds  for  the  testing  file  is  due  to 
the  silences  being  discarded.  Notice  that  even  a  relatively  small  threshold  (about  300-400) 
is  capable  of  almost  halving  the  amount  of  speech  data  to  be  processed. 

As  stated  in  Section  5  the  triphone  models  contained  one,  three  or  six  states.  During 
Baum- Welch  reestimation  for  a  particular  utterance,  a  concatenated  model  is  constructed 
from  the  models  of  the  constituent  triphones  (and  function  words).  Given  the  simple  model 
topologies  used,  at  least  one  frame  of  data  is  required  for  each  state  in  the  concatenated 
model,  otherwise  Teestimation  fails  because  there  are  no  valid  paths  aligning  the  full  se¬ 
quence  of  states  to  input  frames.  Clearly,  as  the  vfr  threshold  was  increased  the  number 
of  frames  in  a  particular  utterance  decreased.  This  resulted  in  two  problems.  Firstly,  even 
at  quite  low  thresholds  some  of  the  utterances  were  too  short  to  be  modelled  and  hence 
could  not  be  used  for  training  purposes.  However,  relatively  few  utterances  were  involved 
(even  at  high  thresholds)  so  there  were  still  sufficient  for  training  purposes. 
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The  other  problem  arose  when  some  of  the  hand  labelled  instances  of  phonemes  used  to 
“seed”  the  models  were  too  short.  Difficulties  only  arose  at  a  threshold  of  1000  when  there 
were  no  valid  hand  labelled  examples  of  several  phonemes.  This  problem  was  overcome 
by  locating  and  hand  labelling  a  few  longer  examples  of  these  phonemes  which  were  then 
used  in  training  at  all  thresholds.  For  speakers  rkm  and  mis.  about  six  triphones  and 
two  function  words  were  involved.  Because  of  the  different  annotation  for  spesdcer  SRJ  this 
problem  did  not  arise. 


6.2  Effect  Of  Different  Thresholds  On  Processing  Times 


Figure  3:  The  effect  of  different  thresholds  on  the  processing  time  used  in  the  training 
phase  for  speakers  rkm  (solid  line)  and  MJR  (dotted  line). 

Figure  3  shows  how  the  processing  time  of  the  training  phase  decreases  with  increasing 
threshold.  As  expected,  the  shape  of  these  lines  are  very  similar  to  those  in  Figure  2.  * 

Even  a  threshold  of  300  produces  a  significant  reduction  in  computing  time. 


6.8  Ttiphone  Models  And  Different  VFR  Thresholds 

The  recognition  results  for  triphone  models  and  different  thresholds  are  shown  in  Ta¬ 
ble  1.  Word  accuracy  results  are  shown  in  Figure  4  for  the  speakers  rkm  and  mjr.  It  can 
be  seen  that  using  VFR  techniques  again  gives  an  improvement  in  performance,  but  this 
increase  is  not  as  marked  as  in  (9j. 

'The  training  time  is  proportions]  to  the  sum.  over  si]  utterances,  of  the  number  of  states  times  the 
frame  length  of  the  utterance.  The  numbers  of  states  do  not  change  but  the  utterance  lengths  do. 
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VFR 

Threshold 

Speaker 

BUB 

MS 

HI 

0 

RKM 

54.1% 

11.2% 

100 

RKM 

54.8% 

15.1% 

200 

RKM 

55.7% 

24.2% 

93.0% 

83.3% 

300 

RKM 

55.5% 

29.5% 

93.1% 

85.4% 

350 

RKM 

55.7% 

32.7% 

92.8% 

85.4% 

400 

RKM 

54.1% 

31.9% 

92.6% 

86.0% 

500 

RKM 

36.5% 

90.2% 

jMEBill 

600 

RKM 

51.3% 

36.0% 

■E2£9I 

700 

RKM 

49.1% 

36.8% 

83.0% 

66.3% 

1000 

RKM 

37.8% 

31.3% 

68.3% 

35.0% 

0 

57.9% 

16.0% 

94.3% 

83.0% 

MJR 

58.8% 

20.9% 

MJR 

60.0% 

32.3% 

n« 

MJR 

60.3% 

MMRfr'Vflll 

|  350  j 

MJR 

60.0% 

37.8% 

86.5%  | 

59.3% 

38.6%1 

S&a 

■ 

58.8% 

~4T8%' 

600 

MJR 

56.4% 

41.7% 

^  700 

MJR 

52.4% 

40.6% 

87.0% 

72.2%  j 

imHE£EQfliiE2!EQi 

42.8% 

36.8% 

IMffl-Kili 

0  j  SRJ 

58.7% 

23.5% 

95.9% 

86.5%  | 

350  SRJ 

59.5% 

■a 

Table  1:  Recognition  results  for  speakers  and  thresholds  as  shown. 


Figure  5  shows  that  the  improved  performance  was  mainly  due  to  there  being  fewer 
insertions.  8  It  can  be  seen  that  the  number  of  substitutions  and  deletions  is  virtually 
constant  for  VFR  thresholds  below  400.  However  there  is  a  steady  decrease  in  the  number 
of  insertions  over  the  same  range. 

As  the  threshold  increases  above  400,  substitutions  and  deletions  begin  to  increase,  as 
do  the  insertions. 


I 


'This  is  only  shown  for  speaker  ytxM  since  the  corresponding  graphs  for  MIR  were  virtually  identical. 
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Figure  4:  Word  accuracy  results  for  speakers  RKM  (solid  line)  and  MJR  (dotted  line). 


Figure  5:  Percentage  errors  in  word  accuracy  results  for  speaker  rkm  counting  substi¬ 
tutions  plus  deletions  (dotted  line)  and  substitutions  plus  deletions  plus  insertions  (solid 
line). 


8 


The  Effect  Of  The  DC  Offset 


6.4 


The  speech  signals  used  have  a  large  DC  offset.  Since  one  of  the  filters  used  in  the  filter 
bank  analyser  is  centred  at  zero  frequency,  this  offset  feeds  directly  into  the  output  of  the 
analyser.  Hence,  when  the  mfcc  coefficients  are  calculated  they  are  all  affected  by  this 
offset.  This  is  not  a  very  desirable  state  of  affairs  so  experiments  were  conducted  on  the 
effect  of  removing  this  offset  by  omitting  that  channel  from  the  analyser.  These  were  only 
conducted  for  speakers  rkm  and  mjr  at  two  thresholds.  The  results  are  shown  in  Table  2. 


VFR. 

Threshold 

Speaker 

DC 

Offset 

pi 

conect 

ione 

accuracy 

W 

conect 

ord 

accuracy 

0 

RKM 

yes 

54.1% 

11.2% 

92.0% 

79.8% 

no 

55.3% 

11.7% 

92.4% 

79.6% 

|  350 

RKM 

yes 

55.7% 

32.7% 

92.8% 

85.4% 

no 

56.9% 

36.2% 

92.8% 

84.6% 

!  o 

MJR 

yes 

57.9% 

16.0% 

94.3% 

83.0% 

no 

57.9% 

18.5% 

92.4% 

79.1% 

1  350 

MJR 

yes 

60.0% 

37.8% 

93.1% 

86.5% 

no 

60.6% 

39.8% 

92.4% 

85.4% 

Table  2:  Recognition  results  for  speakers  and  thresholds  as  shown,  with  and  without  DC 
offset. 


From  the  results  in  Table  2  it  can  be  seen  that  the  only  significantly  different  word 
accuracy  result  from  removing  the  DC  offset  is  obtained  for  speaker  MJR  at  a  threshold  of 
zero,  when  the  word  accuracy  decreases.  In  order  to  investigate  this  behaviour  the  means 
and  standard  deviations  of  the  mfcc  values  were  studied.  For  a  typical  test  file  for  speaker 
mjr  the  means  and  standard  deviations  were  calculated  for  16  mfccs  with  and  without 
the  DC  offset.  These  values  were  calculated  over  the  true  speech  in  the  test  file,  i.e.  there 
were  no  silences,  glitches  or  breath  noises  present,  and  also  over  the  true  silences  in  the 
test  file.  Graphs  were  then  drawn  of  the  means  and  standard  deviations  for  the  true  speech 
and  true  silence,  without  the  DC  offset  (Figure  6)  and  with  the  DC  offset(Figure  7). 

In  both  these  figures,  the  error  bars  span  a  distance  of  twice  the  standard  deviation. 
The  mean  value  is  at  the  centre  of  the  bar.  From  these  figures  it  can  be  seen  that  without 
the  DC  offset  there  is  quite  a  lot  of  overlap  between  the  values  for  true  speech  and  true 
silence.  The  standard  deviations  of  the  mfcc  values  for  true  silence  are  considerably 
smaller  than  those  for  true  speech  when  the  dc  offset  is  present.  The  effect  of  the  DC 
offset  appears  to  be  to  reduce  both  the  standard  deviations  of  the  silence  mfccs  and  the 
overlap  with  the  values  for  true  speech.  It  is  hoped  to  cany  out  further  investigation  into 
this  behaviour  in  the  future. 
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Figure  6:  Means  and  standard  deviations  of  MFCc  values  for  16  MFCC  coefficients  for  a 
test  Me  for  speaker  mjr  at  a  threshold  of  zero  without  DC  offset.  The  bold  lines  show  the 
values  for  the  true  speech  and  the  normal  lines  the  values  for  true  silence. 


Figure  7:  Means  and  standard  deviations  of  MFCC  values  for  16  MFCC  coefficients  for  a 
test  Me  for  speaker  mjr  at  a  threshold  of  zero  and  with  Dc  offset.  The  bold  lines  show  the 
values  for  the  true  speech  and  the  normal  lines  the  values  for  true  silence. 
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6.5  Variance  Weighting  As  An  Alternative  To  VFR 

It  has  been  suggested,  notably  in  Section  7.9  in  [5),  that  vfr  is  capable  of  achieving 
such  good  performance  because  it  downweights  steady  state  regions,  which  typically  occur 
in  long  vowels.  In  these  steady  state  regions  the  effect  of  a  relatively  minor  spectral 
difference  is  amplified  by  the  number  of  frames  over  which  it  is  repeated.  Conversely,  very 
rapid  formant  transitions  at  vowel-consonant  boundaries  may  be  crucial  in  identifying  a 
consonant  although  only  occupying  a  single  frame.  When  trying  to  “match”  an  unknown 
word  with  a  template,  undue  notice  may  be  taken  of  the  steady  state  regions,  i.e.  the 
vowels  are  matched  more  closely  than  the  consonants.  Using  VFR  analysis  counterbalances 
this  effect. 

It  has  been  suggested  ([7])  that  the  effect  of  VFR  in  downweighting  steady  state  regions 
could  be  emulated  for  non-VFR  models  and  data  by  modifying  the  variances  of  the  models. 
Increasing  the  variances  for  certain  states  (or  models)  will  tend  to  decrease  the  distances 
between  the  input  speech  and  those  states.  If  this  is  done  for  model  states  corresponding 
to  steady  state  regions,  then  the  result  will  be  to  reduce  the  per  frame  contribution  of 
discrepancies  in  those  regions  during  recognition. 

As  described  in  Section  3  each  state  in  a  v  F  R  model  includes  the  mean  of  a  feature  which 
reflects  the  average  number  of  frames  in  the  original  analysis  condensed  to  a  single  frame 
corresponding  to  that  state.  These  mean  counts  give  some  indication  of  how  prolonged 
the  corresponding  sounds  tend  to  be.  It  is  therefore  possible  to  weight  each  variance  in  a 
non-VFR  model  by  multiplying  by  the  corresponding  mean  count,  n,  from  the  equivalent 
vfr  model. 

Experiments  were  conducted  on  speaker  RKM  using  the  mean  counts  from  a  model 
file  created  from  data  with  a  VFR  threshold  of  350  to  modify  the  variances  of  a  model 
file  created  using  the  original  (full-rate)  data.  Initially,  all  the  three  state  models  were 
modified,  vowels  and  consonants.  However,  most  of  the  frame  reduction  achieved  by  vfr 
corresponded  to  the  centre  states  in  the  triphone  models,  therefore  only  the  centre  states 
were  modified.  It  appeared  to  be  possible  to  distinguish  the  vowels  and  consonants  by  the 
magnitude  of  these  count  mean  values  for  state  2.  Hence,  experiments  were  conducted 
where  the  variances  were  only  modified  if  the  value  of  the  mean  count  for  state  2,  n,  was 
greater  than  some  value.  Also,  rather  then  using  n  as  the  variance  multiplication  factor, 
various  fractions  of  it  were  used  since  the  effect  on  spectral  distances  is  not  strictly  linear. 
These  results  are  shown  in  Table  3. 

The  results  in  Table  3  are  significantly  worse  than  the  word  accuracy  of  85.4%  obtained 
for  speaker  RKM  with  a  VFR  threshold  of  350.  The  words  correct  are  all  very  similar.  Prom 
this  it  would  appear  that  vfr  cannot  be  explained  purely  in  terms  of  downweighting  the 
steady  state  regions. 


11 


State  2 
Mean 

Scale 

Factor 

Hi 

mu 

[|  unmodified 

92.0% 

79.8%  | 

!i  n>0 

n 

WESES! 

IK E» 

n 

91.9% 

79.1% 

IKES 

n/2 

92.2% 

80.2% 

n  >2 

n/4 

91.9% 

78.7% 

n  >3 

n 

HZE1I 

n  >3 

n/2 

92.0% 

79.6%  |i 

n  >4 

n 

HEEE1 

n  >4 

n/2 

92.0% 

80.0%  |i 

Table  3:  Recognition  results  for  speaker  RKM  and  using  the  mean  count  for  state  2  to 
modify  the  variances  as  shown. 


7  Conclusions 


It  has  been  shown  that  vfr  analysis  can  be  successfully  used  within  the  ARM  system 
without  compromising  performance  levels. 

It  has  been  shown  that  vfr  analysis  can  be  used  with  triphone  models  with  a  consistent 
(though  not  necessarily  significant)  gain  in  performance  at  low  thresholds.  Even  at  low 
thresholds  the  amount  of  processing  time  is  significantly  reduced. 


8  Future  Work 

All  these  results  were  obtained  using  data  analysed  at  100  frames  per  second.  However, 
[1]  suggested  that  the  data  should  be  analysed  at  200  frames  per  second.  Future  work  will 
investigate  the  effect  of  higher  initial  frame  rates  on  the  recognition  performance  obtained 
after  VFR  analysis. 

It  is  hoped  to  carry  out  more  experiments  to  investigate  the  effect  of  the  DC  offset  on 
the  data,  and  hence  on  the  models  created. 
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